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SEMI-PARAMETRIC ITEM RESPONSE FUNCTIONS IN THE CONTEXT OF GUESSING 
Abstract 


We present a logistic function of a monotonic polynomial with a lower asymptote, al- 
lowing additional flexibility beyond the three-parameter logistic model. We develop a 
maximum marginal likelihood based approach to estimate the item parameters. The new 
item response model is demonstrated on math assessment data from a state, and a com- 
putationally efficient strategy for choosing the order of the polynomial is demonstrated. 
Finally, our approach is tested through simulations and compared to response function 
estimation using smoothed isotonic regression. Results indicate that our approach can 
result in small gains in item response function recovery and latent trait estimation. 


Keywords: filtered polynomial, guessing, item response theory 
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1 Introduction 

The three-parameter logistic (3PL; Birnbaum, 1968) model is a commonly used item 
response model that includes a lower asymptote (top panel in Figure 1). This item model 
can be particularly useful in educational assessments when the lower asymptote repre- 
sents the probability of correctly “guessing” the answer to a multiple choice question. 
But, what happens when there is a non-zero probability of guessing and the item re- 
sponse function (IRF) that generated the data does not neatly follow this functional form 
(bottom panel in Figure 1)? 

When the true IRF follows an atypical shape, use of standard item response models 
such as the two-parameter logistic (2PL; Birnbaum, 1968) and generalized partial credit 
models (GPC; Muraki, 1992) can result in poor recovery of the response function, item 
or model misfit, and suboptimal scoring of individuals’ proficiency (e.g., Falk & Cai, in 
press; L. Liang, 2007; L. Liang & Browne, 2015; Ramsay & Abrahamowicz, 1989). We ex- 
pect these same problems to occur when forcing a 3PL model on item response data that 
could have been generated by an IRF with a different functional form. Remedies to this 
problem in general include approaches that allow for more flexibility in the estimated 
IRF. Non-parametric methods may quickly estimate IRFs with good recovery, including 
smoothed isotonic regression (Lee, 2002, 2007) and kernel smoothing (Ramsay, 1991) - 
the latter of which can result in IRFs that are not monotonically increasing. Bayesian 
non-parametric methods may also be employed (Duncan & MacEachern, 2013; Miyazaki 
& Hoshino, 2009; Qin, 1998), but may be slow to estimate (L. Liang, 2007). 

The present research builds upon recent semi- (or quasi-) parametric item response 
functions that are built by replacing the linear predictor of standard response functions 
with a monotonic polynomial (Falk & Cai, in press; L. Liang, 2007; L. Liang & Browne, 
2015). So far, these approaches have been demonstrated to allow more flexibility to the 
2PL and GPC item models, can result in better recovery of IRFs and latent proficiency, 


and perform comparably to kernel smoothing. Specifically, we will show how a logis- 
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Figure 1: Example item response functions 
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Note. The top two panels are response functions fit to math assessment data using the 
3PL item model; the bottom panels are the same items fit with the LMPA model. 
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tic function of a monotonic polynomial with a lower asymptote (LMPA) can allow for 
greater flexibility than the standard 3PL. The LMPA model reduces to the 3PL at the 
lowest-order polynomial. 

We propose that use of the LMPA is possible in conjunction with other standard 
item models (e.g., 2PL, 3PL, GPC) and focus on examples from large-scale educational 
testing. We consider this approach to be potentially appealing and familiar versus a 
fully nonparametric approach in which all IRFs may follow a non-standard shape. In 
addition, we assume that monotonicity of the IRFs is often a desired feature when es- 
timating students’ factor scores. Since IRF estimation utilizing a monotonic polynomial 
has apparently not yet been compared with other nonparametric monotonic approaches, 
through simulations we will compare our modeling approach against smoothed isotonic 
regression (SISO; Lee, 2002, 2007). 

An additional issue we address is the selection of the order of the polynomial for 
each item. Previous research has so far relied mostly on using information criteria (e.g., 
AIC or BIC) for selecting the order of the polynomial, which may require refitting the 
model. For example, Falk and Cai (in press) used an AIC step-wise approach where at 
each iteration, each item in turn was considered as a candidate for being modeled as a 
higher order polynomial. The item that improved AIC the most was selected as having 
a higher-order polynomial, and the process then repeated until no progress could be 
made. When used in conjunction with the EM algorithm (e.g., Bock & Aitkin, 1981), 
this process can be computationally prohibitive with a long test. Thus, the most items 
used in simulations by Falk and Cai (in press) was 20. Use of a different estimation 
approach may be faster computationally (L. Liang, 2007; L. Liang & Browne, 2015), but 
has not been demonstrated for cases where multiple group analyses are employed or 
the dataset contains missing data. As an alternative, we suggest that candidate items 
may be identified without additional model fitting by considering item fit statistics such 


as S — X? (Orlando & Thissen, 2000, 2003). Since S — X? has been successfully used in 
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previous research to identify atypical IRFs, its use has potential in item screening and 
may result in refitting the model a fewer number of times. That is, only items that fit 
poorly according to S — X* may be good candidates for modeling with higher-order 
polynomials. Since it is possible that as part of a testing program poor item fit may be 
used as a flag for deactivating items, this approach also has an advantage if it is able to 
reduce the number of poorly fitting items. 

The remainder of this manuscript is organized as follows. In Section 2, we present 
the proposed item model, LMPA. Section 3 presents an illustration using a large-scale 
state math assessment that uses S — X? to screen items before modeling with the LMPA 
with higher-order polynomials. Section 4 then presents simulation results illustrating 
the ability of the LMPA item model to improve IRF recovery in conjunction with S — X? 
guided polynomial order selection, and its performance against SISO. Finally, Section 5 
contains concluding remarks. 

2 The proposed item response model 
2.1 Logistic function of a monotonic polynomial with asymptote (LMPA) 

To introduce notation, consider 1 = 1,2,...,N examinees complete j = 1,2,...,n 
dichotomously scored items, with observed item responses y;; € [0 1 ]. One way of 
writing the 3PL model for the “correct” response is as follows: 


1G) 


P(1|0;,;,6),7j) = (xj) 4 1+ exp(—(6 + 7/8) 


(1) 


where 6; and ; are the intercept and slopes, respectively. The pseudo-guessing pa- 
rameter, c(kj), determines the lower asymptote, which is an estimate of the proportion 
of examinees in the latent class adopting the “guessing” strategy for this item. This 
pseudo-guessing parameter is reparameterized as a function of x; to allow unconstrained 


estimation: 


(2) 
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To allow one or more “bends” in the 3PL model, we can replace 76; with a monotonic 
polynomial function for item /: 


1—¢(x;) 
1+ exp(—(d; + m(6i, Wj, Xi, T;))) 


P(1(6;,.0;,0;, &;, T;) = c(x;) (3) 


Here the polynomial (omitting the intercept and item subscripts) is of order 2k + 1, 


and its derivative with respect to 6 is of order 2k, 


m(0, Ww, a, v) = m(6,b) _ b10 b> 67 po... 4 bop 02+1 (4) 


m'(6, a) =ag+a,0 4 A607 poses A.6°* (5) 


where k is a user-specified non-negative integer, which may vary across items. Thus, if 
k = 0, the model reduces to the 3PL. To ensure monotonicity of the polynomial (i.e., the 
function increases as @ increases), the derivative is parameterized such that it is always 
positive, which entails the coefficients b = [ b; --- b,4, | are a complicated function 
of the parameters w, «, and t. That the polynomial is monotonically increasing leads 
to a monotonically increasing response function for the correct response. Details of this 
parameterization are given by Falk and Cai (in press), are due much to the hard work of 
others (L. Liang, 2007; Elphinstone, 1985), and appear in Supplementary Materials. 
2.2 Example response functions 

The example response functions in the bottom panel of Figure 1 were based on the 
LMPA model, with k = 1 and k = 2, and their item parameters reported in Table 1. 
These response functions were obtained by analyses conducted on a state math assess- 
ment discussed in Section 3. Note that higher-order polynomials tend to result in more 
flexibility, but could be prone to overfitting noise in the data. 

Non-standard response functions such as these can be the result of a number of 
different reasons. For instance, responses to self-report personality or psychopathology 


rating scales may not follow commonly used parametric forms (Meijer & Baneke, 2004). 
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Figure 2: Mixture of item response functions 
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Table 1: Item parameters from example item response functions 


Parameter Item2 (k=0) Item 24(k=0) Item2(k=2) Item 24 (k = 1) 


K -1.91 -2.14 -1.81 -1.59 
6 -2.19 ae -2.29 80 
Ww .70 -.04 1.14 -.84 
X41 9 A0 
6) 35 

Ty 25 -1.20 
ge) -6.47 


In the context of educational assessments, we propose that data from a heterogeneous 
population could result in non-standard response functions. Suppose data with respect 
to a particular item come from two different unidentified groups that do not differ in 
overall proficiency, but differential item functioning exists for the item. That is, the item 
is easier for Group 1 than Group 2 (see Figure 2). Such a case could happen in practice 
if, for example, the item has unique/specific content that some students were exposed 
to through instruction, whereas other students were not. The data generating model for 


the correct response to this item is then a mix of the two group’s response functions: 


P(1|@) = pi Pi (1|6) + poPo(1|6) (6) 


where p; and p2 = 1 — p; are the proportion of respondents in each group, and P; and 
Py are short-hand for the response functions for Group 1 and 2, respectively. Suppose 
item parameters of « = —1.39 and w = .8 for both groups, but 6 = —4 for Group 1 
and 6 = 4 for Group 2. If we use mixing proportions of 70% and 30% for Groups 1 
and 2, then this results in a response function with a non-standard bend (see Figure 2). 
Although this response function does not come from the LMPA model, the LMPA can 
still provide a good approximation to it. To illustrate, Figure 3 displays the same mixture 


IRF (black line) with the best-fitting 3PL line (blue line; left panel) and LMPA line (blue 
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line; right panel) with only a 3rd-order polynomial (k = 1). Although the LMPA does 
not completely overlap with the mixture IRF, the two are almost indistinguishable. Thus, 
while some approaches do more closely represent the mixing of IRFs or their parameters 
to achieve flexible IRF shapes (e.g., Duncan & MacEachern, 2013; Miyazaki & Hoshino, 
2009) or explicit identification of latent classes (e.g., Rost, 1990), the LMPA approach 
may provide a reasonable alternative solution that can accommodate non-standard IRF 
shapes. Importantly, it may also serve as a leading indicator followed by more detailed 
and computationally-demanding analysis. 

Finally, the same two groups may not necessarily be responsible for non-standard 
IRFs on other items. For example, two other unidentified groups (call these Group A 
and B) whose group membership is not mutually exclusive with Group 1 and 2 could be 
responsible for a non-standard IRF on a second item. Thus, the latent classes involved 
in constructing each non-standard IRF may be different across items, and group mem- 
bership may overlap in a non-trivial way. This could arise in practice if we consider 
students’ exposure to item content due in part to many teachers whose similarities in 
implementation of curriculum overlap in a non-trivial way. 

2.3 Estimation and the use of stabilizing prior distributions 

As Falk and Cai (in press) did for other models including the monotonic polynomial, 
we used the Bock and Aitkin (1981) machinery of maximizing the marginal likelihood 
using the EM algorithm for use with the LMPA approach. Full first- and second-order 
derivatives of the LMPA model necessary for estimation are given in Supplementary Ma- 
terials. Since the LMPA model adds additional parameters to the 3PL, which is known 
to be difficult to estimate, prior distributions were placed on some model parameters to 
provide additional stability. Technically, our approach is Bayesian. It constitutes finding 
the mode of the posterior distribution for model parameters, though we note this ap- 
proach is common in estimation of the 3PL (Bock, Gibbons, & Muraki, 1988; Cai, Yang, 
& Hansen, 2011; Mislevy, 1986). It may be best to understand the effects of these pri- 
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Figure 3: Best fitting 3PL and LMPA functions to mixture IRF 
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Note. Black line is mixture IRF. Left panel includes best fitting 3PL (blue), and right panel 
includes best fitting LMPA line (blue) using k = 1. 

ors as stabilizing soft constraints on gradients and ridging/conditioning constants on 
the Hessian, without completely abandoning the operational efficiencies of maximum 
marginal likelihood. 

Specifically, we used diffuse normal priors on all « and t parameters as also done 
by Falk and Cai (in press), such as (a) ~ N(0,50) and z(t) ~ N’(—1,50). To provide 
stability to the lower asymptote, we use a prior such as 71(«K) ~ MN (—1.39,.25), where 
c(—1.39) = .20, or the rate of guessing we might expect from a five option multiple 
choice question (see also Cai et al., 2011). The final prior used involves that analogous to 
placing a Beta prior on item uniqueness to prevent Heywood cases in item factor analysis 
(e.g., Bock et al., 1988; Mislevy, 1986). We show how to adapt this prior developed under 


multidimensional IRT to the current setting.! 


1A version of the LMPA model that omitted the Beta prior was presented at the 2013 International Meet- 
ing of the Psychometric Society. Omission of the Beta prior, a weaker prior on x, and little troubleshooting 
of starting values tended to yield mediocre performance of the LMPA in simulations. 
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Suppose we derived the LMPA item model using a probit rather than a logistic link 
function. For example, underlying the observed response, yj;;, is a variable that is a 


function of a monotonic polynomial, 


2k+1 
yj = m(6, Aj) +é = y Aq! + €; (7) 
q=1 
We may further assume that ¢; ~ (0, 7), and that 6 ~ N(0,07), usually with 0? = 1. 


Dichotomous responses are produced from yj via: 


Le I Yee Ay 
0, if Yi < 1; 


where 1; is the threshold for item j. The probability of a correct response under this 


model may be written as: 


H 1—c(x;) se r; —m(0,A,) ‘ . 
P*(1|9) Bu ciberrs ath op} 079 (ome) \ (9) 
pone) 


Yj 


= c(kj) + (1 — c(x;))® ( (10) 
where ® is the standard cumulative normal distribution function. This resembles a stan- 
dardized version of the normal ogive model with guessing (e.g., Bock et al., 1988), but 
with the linear predictor replaced by a monotonic polynomial. The goal in remedying 
Heywood cases in this situation involves preventing the unique variance 3 from getting 
too small, or alternatively the steepness of the function implied by m(@,A;) from getting 
too large. If we assume a standard normal distribution for y; and uncorrelated @ and ¢;, 


it follows that 


p; = 1—var(m/(6,A;)) (11) 
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Thus, {7 is not directly estimated, but is a function of m(6,A;). We may also approx- 
imate ~ as a function of the parameters implied by the LMPA: 


2 1 


¥i ~ TP A/D2var(m(,b;) 2) 


where D = 1.702 is the usual scaling constant. Placing a Beta prior on YF, such as 
r( yp") ~ B(p,q) "(pr )P*( - pyr, where B is the Beta function and with q = 1, 
effectively prevents YF from being too small (e.g., Bock et al., 1988), and results in 
changes to the posterior mode for the item parameters in m(6,b;) = m(0,w;,a;,T;). This 
strategy of developing weakly informative priors is not atypical in Bayesian inference, 
where one may choose to motivate the choice by imposing a prior on an alternative 
parameterization of the model (the item unique variance in this case), which then induces 
the desired prior on parameters of interest (the slopes). A typical choice for p is 1.5 (Cai 
et al., 2011), which we employ throughout this paper, and further details as to how 
the Beta prior changes the log-likelihood and derivatives are given in Supplementary 
Materials. 
3 Example: Large-scale state math assessment 

To illustrate the LMPA item model, we utilized data from 10,000 Grade 4 students 
who provided responses to 44 dichotomously scored items on a state math assessment. 
The 3PL model and all LMPA models described in this section used priors as described 
in the previous section. As a preliminary check, use of a 3PL model for all items (AIC 
= 48,5581, BIC = 48,6533) indicated better fit than using a two-parameter logistic model 
for all items (AIC = 48,7290, BIC = 48,7924), and the mean pseudo-guessing parameter, 
c(x), for the 3PL model was approximately .17. We therefore concentrated on the 3PL as 


our baseline model. 
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3.1 Screening of item fit 

The next step was to determine whether the LMPA model may be useful in improving 
model and/or item fit. For this task, we utilized S — X* using the version outlined 
by Orlando and Thissen (2000). This approach was chosen as it is a common way of 
assessing item fit and does not require fitting additional item models to the data. In brief, 
by conditioning on sum-score based groups, the test statistic detects whether observed 
counts of correct/incorrect are congruent with the expected counts based on the model. 
Specifically for item j, 
5 — xX? = yr ny Ona Ei” (13) 
a: e(1 


Ee (1 — Ej) 


where the sum-score groups range from 0,1,...,0 with v being the maximum observed 
sum-score, N; is the number of respondents in sum-score group t, Oj is the observed 
proportion of correct responses, and Ej; is the expected proportion of correct responses. 
Ej, in turn can be computed using the Lord-Wingersky algorithm (Lord & Wingersky, 
1984) as described by Orlando and Thissen (2000). The statistic is compared to a central 
chi-square distribution with degrees of freedom of df = t — 1 —z; where z; is the number 
of estimated parameters for item j. If adjacent sum score groups are collapsed due to 
low expected counts (usually less than 1), additional df adjustments are made. 

The procedure for screening items for misfit for the empirical example, and in simu- 


lations, was as follows, using the baseline (3PL) model as the starting model. 
1. Compute S — X? for all items in the current model. 
2. Flag all items with misfit below a threshold (e.g., p < .05). 


3. For all flagged items (if any), increase the order of the polynomial (e.g., if k = 0, 


change to k = 1; if k = 1 increase to k = 2). 


4. If any items were changed as a result of Step 3, re-fit the model. 
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The Steps 1 through 4 may be repeated as many iterations as desired. For the current 
example and in simulations, we repeated this process only twice, meaning that at most 
the model was fit 3 times (3PL and two cycles of the above steps) and the maximum 
polynomial order for any fitted item was k = 2 (5th order). In addition, for Step 2, it 
is possible to employ either very liberal criteria in screening out items (e.g., no control 
of Type I error or false discovery rate), or to employ a variety of corrective techniques. 
We experimented with using no correction (referred to hereafter as NC), and p-value 
adjustments using the Benjamini-Hochberg procedure (referred to hereafter as BH; Ben- 
jamini & Hochberg, 1995), which is sometimes advocated for use in IRT contexts as a 
high-power alternative to the Bonferroni method (Thissen, Steinberg, & Kuang, 2002). 
While we may expect the NC approach to lead to overfitting, it will have higher power 
and we later examine in simulations whether such overfitting has averse consequences. 
3.2. LMPA Results 

S — X? initially flagged 11 items as poorly fitting under NC and 7 items when using 
the BH correction. Thus, roughly 16% to 25% of items may have poor fit. Flagging and 
fitting items twice under NC resulted in 6 items modeled with k = 1 and 6 items with 
k = 2 using the LMPA model. Under BH, these numbers were predictably lower, with 
5 items as k = 1 and 2 items as k = 2. Both approaches led to fewer items having poor 
fit according to S — X*. For instance, the final NC model had 6 items with poor fit and 
the final BH model had only 2. As shown in Table 2, both final NC and BH models also 
improved AIC, with the NC model indicating slightly better fit. However, BIC actually 
preferred the 3PL model. 

We interpret these results as meaning that there is some promise for the LMPA in re- 
ducing the number of poorly fitting items according to S — X*. The information criterion 
results are mixed in terms of whether the overall final model is an improvement, with 
AIC suggesting an improvement, but a higher penalization for more model parameters 


under BIC suggesting overfitting. It is therefore difficult to guess whether the LMPA is 
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Table 2: Model overview of empirical example 


3PL NC Final Model BH Final Model 


# Parameters 132 168 150 
—2log L 485317 485111 485199 
AIC 485581 485447 485499 


BIC 486533 486659 486580 


Note. 3PL = three parameter logistic; NC = no correction to S — X* p-values; BH = 
Benjamini-Hochberg correction to S — X? p-values. 
substantially improving IRF recovery. We next turn to simulation results to further test 
the potential impact of the LMPA and our modeling procedure. 
4 Simulation 
4.1 Method 

We performed a small simulation study to test the performance of the LMPA IRT 
model and the procedure we employed in using S$ — X? to flag candidate items. Whereas 
the primary goal was to assess IRF recovery of the model versus standard approaches 
(3PL), secondary goals included a comparison of the model versus smoothed isotonic 
regression and recovery of factor scores. Simulations were conducted in R (R Core Team, 
2015) based on programming by the corresponding author. 
4.1.1 Data generation 

For data generation, we used a 2 (% non-standard IRF: 20% vs. 40% of items) x 
2 (N: 1,000 vs. 5,000) overall design. In all cases, we used 40 items, generated latent 
proficiencies, 6, from a standard normal distribution, and performed 100 replications 
per cell of the design. Studied conditions were thus chosen to partially overlap with 
conditions found with the empirical example. 

Most items were 3PL items (LMPA with k = 0) with item parameters (x, w, and 6) ran- 
domly drawn in each replication from a multivariate normal distribution that matched 


the mean and covariance of the all 3PL model in the previous section. To avoid too many 
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extreme items, parameter draws for any item greater than 1.65 SD away from their mean 
were discarded and a new set was drawn. The remaining items (8 or 16 items, depending 
on condition), were randomly constructed from a mixture of normal cumulative distribu- 
tion functions (CDF) and lower asymptotes. Specifically, pi(g¢1 + (1 — 91) ®(0|p1,07)) + 
po(g2 + (1 — g2)®(6|p2,03)) + p3(gs + (1 — g3) (6p, 03)), with p; and pz ~ unif(.2,.4), 
p3 =1—pi— pr fa ~ N(-1.5,.4"), fo ~ N(1.5,.47), 3 ~ N(0,.47), all ¢ parameters 
drawn from unif(.1,.3), and 71,02, and 73 independently drawn from a log-normal distri- 
bution, In.V(—1.03,.22). Here we use ® to denote the cumulative distribution function 
along @ for a given mean and variance. 

4.1.2 Fitted models and factor score estimation 

To each generated dataset, we both mimicked the steps taken in our empirical exam- 
ple, and estimated IRFs using smoothed isotonic regression. That is, we initially fit a 
3PL model, followed by flagging items using NC or BH, and fitting flagged items with 
higher-order polynomials. We repeated re-fitting twice for both NC and BH, resulting 
in final NC and BH models. Thus, we are interested in comparing the 3PL to the final 
NC and BH models. Numerical integration necessary for estimation was done with 49 
equally spaced quadrature points between -5 and 5 across 6. The maximum number of 
M-step iterations was set at 50, and the maximum number of EM cycles was set at 500. 
Model estimation terminated if item parameters from one EM cycle to the next differed 
by .001 or less. 

To further mimic a real data analysis situation, we never started estimation at the 
true model parameters, but rather at x = —1,6 =0,w =0,a =0, and t = —1 for all 
items. However, to automate troubleshooting of estimation problems, we employed the 
following ad-hoc procedure for all replications. Initial starting values for item param- 
eters were further tweaked by using the final estimates of a model run with very few 
maximum M-step iterations and EM cycles (2 and 5, respectively). We then commenced 


with estimation as outlined in the previous paragraph. If the maximum number of EM 
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iterations was reached, or if ridging the Hessian ever failed to yield an invertible matrix, 
new starting values were attempted by using normalized sum scores in the complete- 
data likelihood for each item. The item parameter estimates based on this complete-data 
analysis were then used as starting values. 

For a comparison with our proposed approach, estimation of IRFs via smoothed 
isotonic regression (SISO) was conducted by adapting S-Plus code provided by Lee 
(2002). In brief, isotonic regression is a least squares approach to regression analy- 
sis with the important property that the resulting function is monotonically increas- 


ing. In IRF estimation, given ordered values of the latent trait, 0) < 02 < --- < On, 


the isotonic regression function P*(@) is increasing as a function of the latent traits, 
P*(6,) < P*(@2) < --- < P*(@N). In addition, P*(@) qualifies as the isotonic regression 
function if and only if it minimizes 7, [P(6;) — P (6;)|", where P(@) could be any iso- 
tonic function of 6. In applications of isotonic regression with observed variables, P(6;) 
may take the place of some known value associated with 6;. Although the latent traits 
are unknown, Lee (2002) determines the percentile ranks of @ from observed sum scores 
with ties broken randomly, which may be further transformed onto the hypothesized 
distribution for 6 (standard normal) using the quantile function. P(6;) then becomes the 
observed response (0/1) for respondent i and the function P*(@) that minimizes the sum 
of squares is estimable. 

Since the isotonic regression function, P*(@), often resembles a step function, kernel 
smoothing may additionally be used to provide a smoother IRF shape, which can result 
in better response function recovery (Lee, 2002, 2007). Following Lee (2002, 2007) we 
used a Guassian kernel and based in part on examining the performance of bandwidths 
between .05 and .3 (in .05 increments), we report results based on bandwidths of .2 and 
.1 for N = 1000 and N = 5000 data generating conditions, respectively, as these arguably 
yielded the best performance. Once the SISO function was estimated, linear interpola- 


tion was used if evaluation of the function at a point along @ not already available was 
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required.” For instance, estimation of factor scores under our proposed approach and 
under SISO was conducted using the expected a posteriori (EAP; Bock & Aitkin, 1981; 
Bock & Mislevy, 1982) method using rectangular quadrature across 6 from -5 to 5 in .1 
increments. 

4.2 Results 

Overall it was our hope and expectation that the procedure of using S — X* would 
tend to correctly flag non-standard IRFs (using either NC or BH), which could in turn 
be modelled by the LMPA with higher-order polynomials. This may result in better 
recovery of individual IRFs. 

4.2.1 Ability of S — X? to detect non-standard item response functions 

In general, power to detect non-standard IRFs was slightly less than anticipated. 
Table 3 displays power and false positive rates based on the final NC and BH models 
for all cells in the design. Power is the proportion of items with non-standard IRFs 
that were fitted with the LMPA model with k > 0, whereas the false positive rate is the 
proportion of items with a true 3PL shape fitted by the LMPA model with k > 0. Both 
were computed across all replications in each cell. Thus, power to detect non-standard 
IRFs only reaches a maximum of .36 or .33 for NC and .14 for BH (both when N = 5,000). 
This amounts to correctly detecting approximately 3/8 items for NC and 1/8 items for 
BH when 20% of items have non-standard IRFs, or 5 to 6/16 items for NC and 2/16 
items for BH when 40% of items have non-standard IRFs. 

While we do not immediately have a full explanation, we note that (Orlando & 
Thissen, 2003) found the lowest power for S — X* for items that had a non-standard 
bend or a plateau in the middle of the IRF - visually similar to the items we generated 
- versus items that had non-monotonicity or an omitted asymptote. Since non-standard 


IRFs were generated via a random process, it is not necessarily the case that all items 


Linear interpolation was also used by Lee (2002). In addition, smoothing with the shown bandwidths 
occurs across the percentile ranks for the provisional @ estimates before mapping the smoothed isotonic 
regression functions onto the quantiles for 0. 
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Table 3: S — X? detection rates for the simulation study 


20% non-standard IRFs 40% non-standard IRFs 
N=1,000 N=5,000 N=1,000 N =5,000 


Power 
NC 101 361 111 336 
BH .011 144 .009 144 
False positive 
NC .052 i055 .059 .082 
BH .003 .004 .002 011 


Note. IRF = item response function; NC = no correction to S — X* p-values; BH = 
Benjamini-Hochberg correction to S — X? p-values. 
would in fact look very extreme. This low power rate may not necessarily be problem- 
atic, however, to the extent that both approaches may flag the worst-fitting items and 
flag non-standard items at a higher rate than 3PL items. 
4.2.2 Recovery of item response functions 

We assessed recovery of IRFs by using Root Integrated Mean Square Error (RIMSE; 
e.g., L. Liang, 2007; Ramsay, 1991), defined as the following for item /: 


re 4 (Pi(16,) — Pi(1]6q))2(0,) 1/2 


RIMSE; = 
re, 6(6,) 


x 100 (14) 


where P(1|@) and P(1\@) are short-hand for the estimated and true response function for 
the correct response, ¢ is a standard normal density function, and the sum is across a 
series of Q equally spaced quadrature points along 6 (-5 to 5 in .1 increments).? RIMSE 
thus represents the root of a weighted average squared discrepancy between the true 
and estimated response functions, with more weight given to discrepancies towards the 


middle of the @ distribution. Lower values indicate better IRF recovery. This index 


3Since under our estimation procedure @ was fixed to a standard normal distribution (the same as the 
data generating model), no additional linking procedure was required for assessment of IRF and factor 
score recovery. 
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Table 4: Item response function recovery for replications with items flagged under NC 


# Rep. Overall RIMSE 3PLitems Non-standard items 


N = 1,000 
20% non-standard IRFs 
3PL Model 89 2.67 2.17 4.64 
NC Model 89 2.69 2.24 4.48 
SISO 89 3.54 3.31 4.46 
40% non-standard IRFs 
3PL Model 95 3.16 2.14 4.67 
NC Model 95 3.12 2.21 4.48 
SISO 95 3.71 3.26 4.38 
N = 5,000 
20% non-standard IRFs 
3PL Model 97 1.62 1.03 4.01 
NC Model 97 1.50 1.06 3.23 
SISO 97 2.44 2.22 3.34 
40% non-standard IRFs 
3PL Model 100 2.22 1.04 3.99 
NC Model 100 1.96 1.09 3.26 
SISO 100 2.62 2.19 3.27 


Note. # Rep. = number of replications; RIMSE = Root integrated mean square error; 
3PL = three parameter logistic; NC = no correction to S — X* p-values; SISO = Smoothed 
isotonic regression. 
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was computed for all estimated items and was aggregated across replications and items 
where necessary. 

Since our goal was to compare the LMPA approaches (NC and BH) to the 3PL and 
SISO, we focus only on those replications where S — X? flagged ill-fitting items under NC 
(Table 4) and BH (Table 5). Both of these tables utilized the same simulated data. Overall, 
the NC model slightly improved RIMSE versus the 3PL model across most conditions 
(see Table 4), with gains more prominent at the larger sample size (N = 5000), perhaps 
in part to higher power to flag non-standard IRFs. Examining items where the true 
underlying IRF was either a 3PL or a non-standard item revealed that all of these gains 
were because of improvement in recovering non-standard items. In fact, recovery of 
3PL items was slightly worse for the NC model, though perhaps negligibly so. Overall, 
SISO performed relatively poorly even compared to the all 3PL model. However, closer 
inspection reveals that SISO did well at recovering non-standard items on par with NC, 
but did not recover 3PL items as well. This is not particularly surprising given that the 
3PL and NC models are likely fitting the true item model to the data for many of the 
3PL items. 

Overall, the BH model also slightly improved RIMSE versus the 3PL model across all 
studied conditions (see Table 5), in a similar pattern to NC. Gains in RIMSE were also 
due to better modeling of non-standard items. Albeit based on a much smaller number 
of replications, use of BH appeared to have no averse impact on the recovery of 3PL item 
IRFs, in slight contrast to using NC. Overall recovery for SISO was not as good as the 
3PL and BH models. The pattern was such that SISO did not recover 3PL items as well, 
but slightly outperformed the other approaches in recovery of non-standard items. 

4.3 Recovery of factor scores 


Recovery of factor scores was assessed using root mean square error (RMSE): 


N 1/2 
RMSE = (w \ (6 - a) x 100 (15) 
i=1 
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Table 5: Item response function recovery for replications with items flagged under BH 


# Rep. Overall RIMSE 3PLitems Non-standard items 


N = 1,000 
20% non-standard IRFs 
3PL Model 15 2.71 2.18 4.84 
BH Model 15 2.66 2.19 4.54 
SISO 15 3.49 3.26 4.40 
40% non-standard IRFs 
3PL Model 16 3.18 2.16 4.71 
BH Model 16 3.13 2.18 4.56 
SISO 16 3.73 3.25 4.45 
N = 5,000 
20% non-standard IRFs 
3PL Model 67 1.65 1.02 4.14 
BH Model 67 1.53 1.02 3.58 
SISO 67 2.46 2.22 3.39 
40% non-standard IRFs 
3PL Model 89 2.23 1.04 4.01 
BH Model 89 2.04 1.03 3.55 
SISO 89 2.62 2.19 3.27 


Note. # Rep. = number of replications; RIMSE = Root integrated mean square error; 3PL 
= three parameter logistic; BH = Benjamini-Hochberg correction to S — X* p-values; SISO 
= Smoothed isotonic regression. 
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where 6; is the EAP score for simulee i and @ is the actual simulated value. 

As with IRF recovery, factor score recovery is reported only for replications involving 
ill-fitting items according to NC and BH in Tables 6. Across both sets of results, NC 
and BH tended to result in improved factor score recovery, with the lone exception with 
the smallest sample size (N = 1000) and fewest non-standard items (20%) where NC 
(36.93) was nearly identical to the all 3PL model (36.91). However, it could be argued 
that such gains by NC and BH were very small. For instance, NC and BH improved 
recovery over the 3PL by at most .28 and .2, respectively, both at the largest sample size 
(N = 5000) and with the most non-standard items (40%). The performance of SISO 
in factor score recovery was relatively worse than other methods in all cases, except at 
the largest sample size (N = 5000) and number of non-standard items (40%) where it 
slightly outperformed the 3PL model. 

5 Discussion 

We have presented a new IRT model, the logistic function of a monotonic polynomial 
with asymptote (LMPA), that can account for guessing as does the 3PL, but has a more 
flexible IRF shape. We proposed a possible data generating mechanism that may yield 
such non-standard IRFs - heterogeneous populations whose IRFs are mixed together 
in generating the item responses. We tested use of S — X? to flag candidate items for 
use with the LMPA model in an empirical example and through simulations. Finally, 
our approach was also compared against both the 3PL model and smoothed isotonic 
regression. 

Our empirical example suggests that use of S — X? and LMPA can improve the num- 
ber of well-fitting items without it being necessary to employ a computationally intensive 
step-wise approach utilizing AIC, BIC, or likelihood ratio tests. For instance, the num- 
ber of poorly fitting items was reduced by approximately half or better, depending on 
whether NC or BH p-values under S — X* were examined. Such initially poor-fitting 


items may be those that show instructional sensitivity in cases where curriculum and 
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Table 6: Factor score recovery for replications with items flagged under NC and BH 


NC BH 
# Rep. RMSE #Rep. RMSE 

N = 1,000 
20% non-standard IRFs 

3PL Model 89 36.91 15 37.44 

NC Model 89 36.93 

BH Model 15 37.36 

SISO 89 37.56 15 37.94 
40% non-standard IRFs 

3PL Model 95 36.64 16 36.41 

NC Model 95 36.61 

BH Model 16 36.34 

SISO 95 37.00 16 36.67 
N = 5,000 
20% non-standard IRFs 

3PL Model 97 36.62 67 36.59 

NC Model 97 36.46 

BH Model 67 36.45 

SISO 97 36.92 67 36.88 
40% non-standard IRFs 

3PL Model 100 36.29 89 36.26 

NC Model 100 36.01 

BH Model 89 36.06 


SISO 100 36.23 89 36.20 


Note. # Rep. = number of replications; RMSE = Root mean square error; 3PL = three 
parameter logistic; BH = Benjamini-Hochberg correction to $ — X? p-values; NC = no 
correction to S — X? p-values; SISO = Smoothed isotonic regression. 
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instruction is not implemented in a fully standard way across the entire population. Our 
procedure thus serves the dual purpose of identifying potential items, and allowing re- 
tention of such items (which can be expensive to develop) in a large-scale assessment if 
practice dictates that poor fitting items are flagged for possible deactivation. 

Our simulation results suggest that the procedure utilized on our empirical example 
is sound and can lead to better IRF recovery, though only very small gains in factor score 
recovery. The performance of our approach was especially better with a higher sample 
size, and where more items had true non-standard IRFs. Allowing the LMPA item model 
to be mixed with other item types such as the 3PL allows for parsimony and resulted in 
better overall IRF and factor score recovery than SISO in our simulations. However, we 
note that SISO performed on par and sometimes better at recovering non-standard items, 
perhaps in part to the low power of S — X? in detecting candidate items for use with the 
LMPA item model. It remains possible that the presence of a higher percentage of non- 
standard items on a test could lead to more comparable overall performance between 
both LMPA approaches and SISO. 

It is worth noting that although we assume that monotonic IRFs are desired as this is 
the most likely case in large-scale testing programs, estimation of IRFs can be performed 
by both nonparametric and semi-parametric methods that allow for nonmonotonicity. 
For example, kernel smoothing (Ramsay, 1991) could be employed, or the constraints 
enforcing monotonicity of the LMPA could be released by reparameterizing (e.g., see 
Falk & Cai, in press). In retrospect, it may also be worthwhile to explore alternative 
ways of screening for poorly fitting items as some of these may have higher power than 
S — X* and involve fitting nonparametric approaches that can reveal nonmonotonicity 
(Douglas & Cohen, 2001; T. Liang & Wells, 2015; Wells & Bolt, 2008). 

Finally, although use of S — X? for identifying candidate items for the LMPA model 
may not result in the best IRF recovery possible, we argue that our goal is not to nec- 


essarily find the best solution, but to merely improve item fit and IRF recovery in a 
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computationally efficient way. Our examples focus on tests with less than fifty items, yet 
we note that many testing programs may use hundreds or thousands of items, which 
can make finding the most optimal solution unfeasible if coupled with any step-wise 
approach. Thus, we close with the observation that our approach can be used in such 
situations (unlike a step-wise approach), but that further room for improvement is pos- 
sible in determining the order of the polynomial for the LMPA and other monotonic 


polynomial item models. 
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